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Abstract — The problem of distributed access of a set of N on- 
off channels by K < N users is considered. The channels are 
slotted and modeled as independent but not necessarily identical 
alternating renewal processes. Each user decides to either observe 
or transmit at the beginning of every slot. A transmission is 
successful only if the channel is at the on state and there is 
only one user transmitting. When a user observes, it identifies 
whether a transmission would have been successful had it decided 
to transmit. A distributed learning and access policy referred to 
as alternating sensing and access (ASA) is proposed. It is shown 
that ASA achieves the throughput region of the optimal centrally 
coordinated scheme. Furthermore, it is shown that ASA has finite 
expected regret when it is compared with the optimal centralized 
scheme. 
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I. Introduction 

THE problem considered in this paper, in its more general 
form, is related to distributed allocation of N independent 
and randomly available resources among K < N agents. 
By distributed allocation we mean that there is no central 
controller assigning resources to agents, and each agent acts on 
its own without communicating with others. We are interested 
in whether there is a distributed access policy that, through 
taking actions and learning from outcomes of actions, achieves 
the utilization of resources comparable to that of the optimal 
centralized allocation. 

We study the above in the context of multiaccess of N 
on-off random channels by K independent users. We are 
interested in whether any distributed learning and access policy 
is necessarily penalized by the lack of coordination. The 
performance measure of interest is throughput — the fraction 
of time that transmissions are successful. For a K user 
multiaccess system, the throughput is defined by a vector 
r = (n, ■ ■ ■ , Tk) where r% is the throughput of user i. If r can 
be achieved by a central controller who assigns a channel to 
each user, we would like to achieve the same by letting users 
act independently on their own. 

If N — 1, the problem is a case of the classical random 
access problem for which the celebrated slotted ALOHA 
protocol can be used to achieve, asymptotically as K — > oo, 
the aggregated throughput (sum rate) of e™ 1 . Although the 
optimal policy of distributed random access for this case 
is unknown, it is well known |fl] that distributed random 
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access cannot achieve the throughput of the optimal centralized 
channel allocation scheme, which is 1. 

We consider the case when N > K under a more general 
condition that channels are on and off stochastically. In prac- 
tice, for example, a channel may be unavailable due to fading. 
In the context of multiaccess of cognitive radios 12], a channel 
may be unavailable because it is being used by another user 
of higher priority. 

If N > K, it is no longer obvious that distributed multiac- 
cess performs strictly worse than a centralized one. Because a 
user is not restricted to transmitting on one particular channel, 
it can search for opportunities elsewhere to avoid colliding 
with others. Intuitively, as N increases, conflicts among users 
diminish, and users may be able to orthogonalize themselves 
to avoid collision. Even when K = N, a user can learn 
where other users are transmitting and act accordingly to 
avoid collision. But learning in an uncertain environment is 
not perfect. It is also not obvious that mistakes in learning 
only cause negligible performance loss. 

Like many online learning problems in uncertain envi- 
ronments, to achieve the best performance requires careful 
tradeoffs between exploration and exploitation. The results 
presented in this paper is an instance of such tradeoff that 
balances sensing and transmission. 

A. Summary of Results 

The detailed system model and assumptions are given in 
Section HI] Here we outline the context of the problem and 
summarize our main results. We consider TV independent but 
not necessarily identical on-off slotted channels. Our results 
apply to more general settings, but at the moment, it is 
sufficient to think that these channels as independent Bernoulli 
channels with probability rj, that the channel is at the on state. 
Let rj = (771, . . . , ?7iv). 

Let r.i be the the throughput of user i. For the K user 
multiaccess system, define a throughput region 



M = {(r u ...,r N ) I r (i) <r? (i) ,l <i< N}, 



(1) 



where 77j) and r}(j\ are the ordered list of r, and 77,, respec- 
tively. When K — N, it should be obvious that, if there is 
a central controller to assign channels to users, any point in 
M can be achieved. If the central controller does not know 
the channel states at the beginning of each slot, then M is the 
maximum achievable region by a centralized multiple access 
system. 

For the distributed access system, let r, be the targeted in- 
dividual throughput for user i. A rate vector r = (7*1, • ■ • , Tk) 
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is achievable if there exists a distributed multiaccess scheme 
such that the long term average of successful packet delivery 
of user i is rj. The maximum achievable rate region & is the 
union of all achievable rate vectors. Naturally, the throughput 
region of the centrally coordinated scheme dominates that of 
the distributed scheme, i.e., S% C 

The main result of this paper is to show that, under the 
model specified in Section [TT1 

SI = ~M. (2) 

This result is established by constructing a distributed learning 
and access policy executed independently by all users. The 
policy alternates between sensing and access periods, hence 
referred to as the alternating sensing and access (ASA) policy. 
We show that ASA achieves the throughput region of the 
optimal centralized scheme. 

The throughput result above is a direct consequence of a 
more refined analysis based on the notion of regret between 
the total number of successful transmissions of the optimal 
centralized scheme Si(n) and that of the distributed scheme 
Si(n) proposed here. We show in in Theorem [TJ that, if r € 8$, 
the expected regret approaches to a constant, i.e., 

E(Si(n) - Si(n)) ~ O(l). (3) 

B. Related Work 

The problem of orthogonalizing multiple coexisting users in 
a distributed manner through learning and individual actions 
has been studied as a decentralized learning of multi-armed 
bandit (MAB) processes involving multiple players in 0. 
Essentially the same problem has also be studied for the 
multiaccess problem in multiuser cognitive radio systems 0, 
0. There are similarities and significant differences between 
these "MAB approaches" and that considered in this paper. 

The MAB formulation involves independent random pro- 
cesses, often assumed independent and identically distributed 
(iid) in time but may also be Markovian. Each process 
is associated with an unknown deterministic parameter. Lai 
and Robbins considered the single user (deterministic) MAB 
problem aims to maximize the accumulated reward using 
knowledge learned from the outcome of past plays 0. The 
problem falls in the category of "learning through doing". 

The centralized multiuser version of the MAB problem was 
considered in as a single user MAB problem but with 
the possibility of simultaneously playing multiple arms. The 
decentralized MAB problem was addressed explicitly in 
and in the context of cognitive radio systems in 0, 0. 
Typically, learning in the MAB problem refers to learning 
which arms are more favorable to play. The regret of the order 
optimal distributed learning with respect to the oracle player 
often increases with the number of plays as O(logn), unlike 
that in 0. 

The problem considered in this paper does not belong 
to the category of MAB problem although it shares some 
common characteristics with the MAB formulation. A key 
difference is that the parameters of the underlying random 



processes are known here. Thus there is little ambiguity on 
which channels are favorable for transmissions. Learning in 
this context deals with searching for appropriate channels to 
transmit, not knowing (for certain) the presence of other users. 

The uncertainty associated with the presence of other users 
is a key distinction between the problem treated here and 
the MAB formulation. For the multiuser MAB problem, the 
presence of other players are certain whenever two players 
play the same arm. In our case, a failed transmission may be 
caused either by collision or by that the channel is off. 

A related problem is learning parameters of multiple inde- 
pendent processes when a user can choose where and when to 
observe a particular process ll8l- lfT0l . Without actively engag- 
ing with other users, such formulations are more akin to the 
classical parameter estimation problems, not one of "learning 
through action" studied in our and the MAB formulations. 

We commented earlier that the problem studied here can be 
viewed through the lens of multichannel random access where 
the fundamental problem of maximum achievable throughput 
region is unknown for most cases. The case when N = 1 
has been studied extensively, and the problem of character- 
izing maximum throughput remains open. Generalizations to 
N > 1 cases often deals with specific access policies such as 
ALOHA. 

II. System Model and Assumptions 

The multiaccess system considered includes N channels, K 
independent users, and a basestation. We specify their roles in 
their interactions and assumptions made in this paper. 

A. Channel Model 

The N channels are slotted and statistically independent. We 
consider a slot atomic, which means that it cannot be divided 
further so that multiple actions can be taken within one slot. 
The channel state of each channel in a slot is either "on" or 
"off" with "on" indicating that the channel can be used for 
transmission and "off" otherwise. The state of each channel is 
therefore a discrete-time binary process for which we model 
it as a renewal sequence alternating between consecutive "on" 
and "off" periods. 

Let F°" be the distribution of the on period of channel i and 
F°" the off period counter part. We assume that F°" and F°" 
have finite means and /i° ff , and finite standard deviations 
of and af, respectively. Denote the long term fraction of on 
periods of channel i by rji = /i™/ (fif + /xf f ). 

B. User Action and Feedback 

Users act independently and persistently, each aimed at 
achieving some fixed throughput. They do not have a syn- 
chronized starting slot; they may enter in or depart from the 
system any time. 

A user makes the decision either to access a channel or to 
sense a particular channel at the beginning of slots based on 
the outcome its own past actions. If the user decides to take the 
action of accessing channel i in a slot, it transmits a packet to 
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a basestation over channel i. If the action of observing channel 
i is taken in a slot, it monitors the channel i. 

When a user transmits over a particular channel to the 
basestation, it receives a binary feedback at the end of the 
slot from the basestation indicating whether the transmission 
is successful. When transmission is successful, we call the 
channel over which the transmission occurred is available, 
which means that the channel is at the on state and no other 
user transmits. The user receives a feedback symbol "a". The 
transmission fails if the channel is at the off state or when 
multiple transmissions occur at the same time. In this case, we 
call the channel unavailable, and the user receives a feedback 
symbol "a". We note that the binary feedback does not specify 
which type of failure occurred to the transmission. 

If a user decides to sense a particular channel in a slot, 
it observes the channel and decides whether the channel 
is available, i.e., whether a transmission would have been 
successful had the user decided to transmit. The outcome of 
the sensing action is again binary with "a" indicating that the 
channel is available and "a" the opposite. 

Note that the information obtained by a user in a slot 
through observation is identical to that through the feedback 
of a transmission. The difference is that there is a potential 
reward or damage caused by transmission. 

III. A Distributed Learning and Access Policy 

In this section, we present a learning and access policy 
referred to as alternating sensing and access (ASA) policy. 
We show later in Section [IV] that ASA achieves the same 
throughput region as the optimal centralized scheme. 

The process of distributed orthogonalization is dynamic. 
Two users collide, which may causes one or both switch to 
a separate channel. The switch may cause further collisions 
with others. Because a user cannot be certain that a failed 
transmission is caused by collision, it may decide to switch 
to a different channel when in fact the failed transmission 
is caused by channel. The key of the learning and access 
policy presented here is to mix the actions of transmission and 
observation to reduce collisions and recover when collisions 
occur. Here we have a case of dynamic learning where a 
balance between exploitation and exploration must be made. 

A. ASA Policy State and State Transition 

Every user executes the same ASA policy independently. 
The structure of ASA is illustrated in Fig. Q] where the policy 
traverses among three policy states: channel selection, sensing, 
and access. We describe the function of ASA at each state as 
we follow one user traversing through various states. 

We focus on user i who just enters the system, wishing 
to communicate at the rate of r.;. User i starts at the chan- 
nel selection state knowing that there is a set of channels 
% = {k : r/k > Ti} that can accommodate his rate of 
communications. He selects randomly with equal probability 
one of the channels as his initial candidate for access. With 
that choice, he proceeds to the sensing state. 




Fig. 1. State diagram of alternating sensing and access (ASA) policy 

At the sensing state, the user sense the channel for a period 
of consecutive L s slots. At the end of the sensing period, a 
hypothesis test is made to test the hypothesis that the channel is 
unoccupied. If the user believes that the channel is unoccupied 
(he may be wrong of course), he enters the access state. If, on 
the other hand, the test result is that the channel is occupied 
by another user, he flips a fair coin to further decide whether 
he should search for opportunities in other channels, or still 
enter the access state to show his presence to other users. If 
a tail shows up, then the user returns to the channel selection 
state (as described by "occupied and tail" in Fig. [TJ. There, 
again, he chooses randomly another channel from c €\ (with 
replacement). Otherwise, if a head shows up, the user still 
enters the access state to transmit and let other users be aware 
of his presence (as described by "vacant or head" in Fig. [TJ. 

At the access state, user i transmits with probability qi for 
a period of consecutive L t slots where qi is chosen to achieve 
the desired throughput. At the end of each slot, user i receives 
a feedback. At the end of the transmission period, a hypothesis 
test is made to check if he has been colliding with another user. 
If the test result is that another user is accessing the channel at 
the same time (again he may be wrong of course), he returns 
to the channel selection state. If, on the other hand, the user 
believes that there is no competing user, he stays at the access 
state for another transmission period that has L' t > L t slots. 

The detailed specification of ASA now reduces to finding 
appropriate durations of sensing or transmission periods and 
construct a detector for channel occupancy. 

B. Time Structure of ASA and Detection Period 

ASA alternates between sensing and access periods, punctu- 
ated by detection actions. This structure is illustrated in Fig. [2] 
where we refer to the time after a detection and the time 
completing the next detection as a detection period during 
which the user collects either feedback samples (if in the 
access state) or observation samples (if in the sensing states) 
before a test is performed at the end of the detection period. 
The length of the fcth detection period is denoted by Lk- 

A key idea of ASA is to let Lk be an monotonically 
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increasing function. Indeed, one can show that if L k does 
not grow, ASA does not achieve the performance offered by 
the optimal centralized scheme. Here we let L k form a linear 
progression by 

L k+1 = L k + C (4) 

for some C > 0. 

The significance of monotonically increasing the detection 
period is twofold. First, with increasing L k , detection accuracy 
improves. We show later in Section|IV]that the detector used in 
ASA has error probabilities decay exponentially with respect 
to L k . 

Second, the increasing L k provides a level of stability to 
the policy. A user who finds the correct channel tends to stay 
there until completion; it is unlikely a new user can bump a 
settled user off its track. 

C. Channel Availability Test 

We say that the channel is available in a slot if the channel 
is on and no one transmits in that slot. We present here a 
detector that tests channel availability. 

Note that because the feedback from transmission and the 
outcome of sensing give the same information, the detector 
used at both the sensing and access states are identical. In both 
cases, for a sensing or access period of L slots, the user obtains 
a sequence of binary outcomes {a,a} L with a indicating that 
channel is available. 

Let L a be the number of slots that channel i is available. 
The availability test is a threshold test on the sample mean of 
the average availability, i.e., 

j available 

^ TH-e (5) 

unavailable 

where e > is an arbitrarily small constant. 

It is not difficult to see that, if there is a persistent user 
occupying channel i, the above detector detects correctly 
with high probability. On the other hand, if the channel is 
available, the probability of mistakenly detecting the channel 
as unavailable decays with L. 

When the underlying channel state processes are alternating 
renewal processes, we claim the following: 

Lemma 1: The error probabilities of the channel availability 
detector given in ® decay exponentially with L. 

The proof of the above lemma is given in the Appendix. 
Note that the length L k of the /cth detection period increases 
with k, the above lemma also implies that detection error 
probabilities also decay exponentially with k. 



Detection period 




sfensingl access I access sensing access 



Fig. 2. Illustration of detection period and increasing detection period length 



IV. Main Results 

We present in this section the main results and show that 
ASA achieves finite expected regret compared with the optimal 
centralized scheme. 

Define regret lZ n as the difference in the total number of 
successful transmissions (summed over all users) between the 
centrally coordinated scenario with pre-determined channel 
assignment and the distributed multiaccess scheme in the first 
n detection periods. With this we state our main result on 
expected regret. 

Theorem 1: Let M be the maximum throughput region 
achievable by a central controller for a K user multiaccess 
system involving N independent alternating renewal on-off 
channels. Then the expected regret for ASA policy converges 
to a finite value, i.e., 

E(7£ n ) -> C as n . ->• oo 

Consequently, the throughput region achievable by the dis- 
tributed learning and access policy ASA <^ ASA = 8%. 

The proof of Theorem [T] is given in full in the Appendix. 
Here we present a sketch that outlines main ideas behind the 
proof. 

When all of the K users are in the accessing state in separate 
channels, no expected regret is incurred. Therefore the ex- 
pected regret is solely incurred in the undesirable configuration 
in which there are still some users not in accessing state or 
not in a separate channel. To investigate the undesirable event, 
the first ingredient we need is the exponential decay of e , 
the probability that there are still some users not in accessing 
mode or not in a separate channel in the ith detection period, 
in the detection period index i. This quantifies the probability 
of the undesirable event over the evolution of the policy, and 
is given in Lemma @] Lemma [3] serves as a stepping stone to 
Lemma @] 

To capture the expected regret in the first n detection peri- 
ods, Lemma |2] provides an upper bound (fT4l for the expected 
regret lZ n , which involves P. l e (decreasing with i) and Li 
(increasing with i). By Lemma |4] Pj >e decays exponentially 
fast, while according to the policy design, the detection period 
length Li only increases linearly. 

The fast decay of Pj e and relatively slow growth of Li 
guarantees that the upper bound (TBI i sums to a finite value. 
This completes the proof of Theorem Q] In the Appendix, we 
list the required lemmas (Lemma [2] to |4j and describe the 
procedure to prove them. 

Provided the finite expected regret result, it is relatively 
straight forward to show that the ASA policy achieves identical 
throughput region with the centralized scheme, i.e., & ASA = ffl, 
by dividing the time horizon and taking the limit. 

V. Numerical Results 

We conduct numerical simulations for various channel and 
user scenarios. 
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Fig. 3. Homogeneous channels. K = 2, 4, 6, N = 6 

A. Simulation setup 

We adopt geometric distribution for the channel on and 
off period lengths. For homogeneous channels situation, the 
average on and off period lengths are p°° = 3.23 and = 
1.43 (the long term channel available fraction is 77 = 0.693). 
For heterogeneous channels situation, there are half of the 
channels with channel parameters p°° = 3.23 and p°" = 1.43 
(rj = 0.693), and the other half with channel parameters 
H m = 3.23 and p"" = 4.3 (r) = 0.429). 

The initial detection period length is Lq = 24 slots, each 
time incremented by C = 12 slots. The entire horizon is taken 
to be 5000 slots, and the number of Monte Carlo runs is 20. 

B. Increasing the Number of Users K 

We simulate the effect of increasing number of users with 
N = 6 and K = 2, 4, 6, and the simulated regret is depicted 
in Fig. [3] All the users have targeted individual throughput 
r = 0.5. 

As predicted in Theorem Q] the simulated expected regret 
indeed levels off eventually, verifying the result of finite 
expected regret. The three curves for K = 2,4,6 in Fig. [3] 
clearly show an increasing trend of the expected regret and 
the time it takes for the expected regret to converge when K 
increases. This trend is quite intuitive; when there are more 
users, the entire process takes much longer. 

C. Fixed vs Increasing Detection Period Length 

To compare the impact of fixed and increasing detection 
period length, we simulate the situation with initial detection 
period length 24, and incremental of 12 slots and slots each 
time. The simulated expected regret are shown in Fig. [4] and 
[3] for N = K = 4 with fixed and increasing detection period 
length. 

The expected regret associated with increasing detection 
period length outperforms the counterpart with fixed detection 
period length. This comparison demonstrates the necessity of 
the increasing detection period length structure for the desired 
finite expected regret, and justifies the rationale of establishing 
the exponential decay of the error probability in detection. 

VI. Conclusion 

We consider in this paper the problem of distributed learning 
and multiaccess of orthogonal channels. We have shown that 
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Fig. 4. Homogeneous channels. N = K = 4. 
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Fig. 5. Heterogeneous channels. N = K = 4. 

perfect orthogonalization can be achieved by a distributed and 
asynchronous learning and access policy in the sense that the 
throughput region of a centralized scheme can be achieved. 
In fact, we have established that the expected regret of the 
proposed distributed scheme with respect to a centralized 
scheme is finite. 

Appendix 

A. Proof of Lemma 1 

Proposition 1: (Central limit theorem for counting pro- 
cesses of renewal processes) For a renewal process with i.i.d. 
inter renewal length I\, T2, . . . , T n , . . ., the counting process 
S n is defined as 

i 

S n = max{i : Tj < n}. (6) 
i=i 

The counting process S n is asymptotically normal as n 
approaches infinity. Specifically, suppose that the mean [i and 
variance a 2 of the length of the inter renewal period T are 
finite, and let 

_ S n -n/fj, 

Zj n , — , ~ . 

cr^n/p 3 

Then the distribution of S n converges to the standard normal 
distribution as n — > 00. 

After providing the central limit theorem for the counting 
process, we are ready to prove Lemma Q] Specifically, we 
will upper and lower bound the detection statistic L a /L, and 
then prove both the upper and lower bound have exponentially 
decaying probability to deviate from their identical mean, thus 
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showing the detection statistic L a /L must have exponentially 
decaying tail probability on both ends away from its mean (77, 
for Ho and 77,(1 — A) for Hi, where A is the transmission 
probability of the user(s) already accessing the channel). 

Proof: We first upper and lower bound the detection 
statistic L a /L in Eq. (0. 



< L a (n)/n < 



(7) 



n n 
where X{ is the number of available slots (channel available 
and no other user transmitting) experienced by the user in the 
ith on period for the channel, and S n is the counting process 
of the renewal process with inter renewal block composed of 
the union of a pair of consecutive on and off period. 
By the results (0 and (0 in renewal theory 

lim ^ = I, (8 ) 

ti->oo n /j, 

lim S n = 00, (9) 



it is easy to show that as n approaches infinity, the three terms 
in Eq. (0 have identical mean 77; under Ho, and rji (1— A) under 
Hi. 

We have to show that L a (n)/n converges to its mean with 
exponentially decaying tail probability under both hypothesis. 
This can be done if we can show the leftmost and rightmost 
sides in Eq. (0 converge exponentially fast to their expected 
value, respectively. 

We will treat the rightmost side of Eq. (0), and the proce- 
dure is similar for the leftmost side. Specifically, rewrite the 
rightmost side of Eq. (0 



n 



y^,j— i x% s n + 1 s n 

S n + 1 S n n 



(10) 



and we will show the three terms V 1 in Z and — 
converge exponentially fast to their expected values EX, 1, 
i, respectively. 

Due to the nature of the alternating renewal channel process 
and the probabilistic transmissions, the claim of the term 
' follows from the standard large deviation result of 
i.i.d. sum, and Eq. (0. 

The claim of the term % +1 follows from Proposition 
Specifically, 



5,1 



P( 



Therefore 



>l + e) 



<-) 



= P ( Sn-rt/M < 1 



n/(i 



)(H) 



"( g +1 > 1 + e) decays with exponent (up to a 



constant factor) 



which has order n. 



Similarly, to show the claim for the term — involves 
applying Proposition Specifically, 



<? 1 n 
»(—>- + e) = P(5„ > - + ne) 
n u. /i 

m ,S„ — nl u 
= F( n /p > 



ay/TJJ? 



(12) 



(13) 



Therefore > — + e) decays with exponent (up to a 



constant factor) 



(T-//J : 



■, which has order n. 



Note that the quantity fx and a 1 in Eq. (Qj}, ( fT2l i and ( fT3b 
are the mean and variance of the inter renewal block length 
composed of the union of a pair of consecutive on and off 
period. 

We can show the other side of the probability inequalities 
for P(^|±i < 1 - e) and P(^ < i - e) in the same way. 
Therefore we have established the exponential decay of the 
tail probability of the three terms in Eq. ( [Tol l. This leads to 
the exponential decay of the tail probability of the detection 
statistic L a /L in the length of the detection period length n 
(also L). 

By the threshold structure of the detection, we conclude 
that the miss detection and false alarm probabilities decays 
exponentially with respect to the detection period length n. ■ 

B. Lemmas for Theorem 

We would like to show a finite expected regret WR. n between 
the ASA policy and that with central coordination, to prove 
Theorem As discussed earlier, the regret lZ n will be small 
if the fraction of time the users spend in the accessing mode 
with orthogonalized channel occupancy is high. Indeed, this 
relationship is formalized in Lemma showing that the 
expected regret TZ„ is always upper bounded by (TBi i. 

Lemma 2: 



x N. 



(14) 



Proof: We break down the regret according to the de- 
tection periods. In the zth detection period, if the users are 
orthogonal and all in access mode, then there is no expected 
regret incurred. Otherwise, the regret in the ith detection 
period can at most be as large as the total number of slots 
contained in the N channels in this detection period, which is 
exactly the number Li x N. Therefore the regret incurred in 
the ith detection period is at most P^ e x i,; x N. Summing 
over all detection period from 1 to n yields the desired formula 
dU. ■ 

From Eq. (TBI the expected regret will be finite if the 
probability that there are still some users not in accessing mode 
or not in a separate channel, decays fast enough compared with 
the growth of Li. 

The factors that drive the decay rate of Pi >e include the 
decay rate of the detection error (how accurate is the inference) 
and the transition rule's ability to adjust and separate when 
collision happens (whether the distributed transition rule is 
indeed leading users to separate gradually). More rigorously, 
Lemma shows the quantitative relationship between these 
two drivers and Pj e . 

Lemma 3: The following recursion in the detection period 
index i holds for Pj e 



1 K 

»i+3,e < 3NPiJ, m + Pi, e (l ^k]J 



N, 



L 



k=l 



(15) 
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where Pij, m is the sum of the miss detection probability and 
the false alarm probability with detection period length L L . 

Proof: The proof of Lemma [3] involves three parts. The 
first part shows that it is always possible that the configuration 
of the N users will be corrected in at most three detection 
periods, if the configuration of the current detection period is 
not orthogonal accessing, and the detection results within the 
three detection periods are all correct. This part verifies that 
the transition rule adopted is indeed capable of adjusting and 
separating the users when collision happens. 

The second part shows that provided that the associated 
inferences are all correct, the probability of the correc- 
tion within three detection periods is always larger than 
~W LTfcLi jV '°jv fc+1 ' wnere Nje is the number of qualified chan- 
nels for user k, and Nk increases with k. This part verifies that 
the random channel selection in the transition rule is making 
strict progress gradually. 

The third part verifies Eq. ( fT3T > by analyzing the events of 
detection error and configuration error. 

We start by showing the first part by enumerating the 
possible undesirable configurations. 

1) Some user are still observing an vacant channel in the 
ith detection period. In this case, the user will correctly 
identify the opportunity and in the (i + l)th detection 
period transition to the accessing mode. 

2) Several users are observing the same channel in the ith 
detection period. In this case, the user will correctly 
identify the vacancy and in the (i + l)th detection period 
transition to the accessing mode. However, this will lead 
to multiple users transmitting in one channel, which will 
be correctly detected. Therefore in the (i+2)th detection 
period the users will randomly select channels. With 
positive (lucky) probability, the selected channels will 
be vacant and orthogonal, and in the (i + 3)th detection 
period the users will transition to the accessing mode. 

3) Several users are accessing the same channel in the ith 
detection period. In this case, the user will correctly 
identify the collision among users and in the (i+l)th de- 
tection period transition to the sensing mode. Therefore 
in the (i + l)th detection period the users will randomly 
select channels. Still with positive (lucky) probability, 
the selected channels will be vacant and orthogonal, and 
in the (i + 2)th detection period the users will transition 
to the accessing mode. 

4) Some user is still observing an occupied channel in 
the ith detection period. In this case, there are two 
scenarios to analyze: 1) there is another vacant channel 
qualified for the user, 2) there is currently no vacant 
channel qualified for the user, i.e., all qualified channels 
for the user is currently occupied. For scenario 1, the 
user will correctly identify the unavailability and flip 
a coin with tail outcome (1/2 probability), and in 
the (i + l)th detection period randomly select another 
qualified channel. With positive (lucky) probability, the 
selected channel will be vacant, and in the (i + 2)th 
detection period the user will transition to the accessing 



mode. For scenario 2, the user will correctly identify 
the unavailability and flip a coin with head outcome 
(1/2 probability), and in the (i + l)th detection period 
start accessing the channel. This will lead to collision 
in this channel, and in the (i + 2)th detection period the 
users will evacuate from the channel and randomly select 
channels to sense. At this time, with positive (lucky) 
probability, the selected channels will be vacant and 
qualified for the users involved in the collision, and in 
the (i + 3)th detection period the users will transition to 
the accessing mode in separate channels. 

Thus we have shown that if the channel configuration of the 
current detection period is not orthogonal accessing, there is 
always possibility that the configuration is corrected in at most 
three steps, provided the detection is correct in the ith, (i + l)th 
and (i + 2)th detection periods. 

The second part can be shown by inspecting the required 
lucky probability. The distributed channel selection by the 
users incorporates randomness, as well as randomness from 
the fair coin flipping in the sensing state, and with probability 
at least Nk N k+1 tri e channels can be orthogonalized 

by the distributed random channel selection. 

The third part involves analyzing events and algebra. Specif- 
ically, 

Pi+3,e < PW)+F(4) (16) 

< 3NP iJ>m + V{si 3 ) (17) 

< 3NP iif , m + P,, e (l - ± n ^ ~n + 1 j» 8 > 

k=l k 

where six corresponds to the event that at least one user makes 
a detection error (either miss detection or false alarm in either 
mode) in the ith, (i + l)th or (i + 2)th detection periods, s/2 
corresponds to the event that all detections made by all users 
are correct in the ith, (i + l)th and (i + 2)th detection periods, 
and the configuration in the (i + 3)th detection period is still 
undesirable, si% corresponds to the event that the configuration 
in the ith detection period is undesirable, and the random 
channel selection is not able to separate the users (unlucky), 
and Pij,m is the probability that either miss detection or false 
alarm occurs in one user with detection period length Li. 

Specifically, the union bound and the fact that Pi+ij. m < 
Pi,f,m yields P(six) < 3NPij tTn . Event sii is a subset 
of event s/3, since provided that all detections made by 
all users are correct in the ith, (i + l)th and (i + 2)th 
detection periods, if either the configuration in the ith detection 
period is desirable, or the random channel selection is able 
to separate the users (lucky), then the configuration in the 
(i + 3)th detection period has to be desirable. This will yield 
¥(s/ 2 ) < P(^). Finally, the probability that the random 
channel selection is able to separate the users (lucky) is lower 
bounded by Uk=i Nk ~N* +1 ■ Therefore the probability that 
the random channel selection is not able to separate the users 
(unlucky) is upper bounded by 1 — tA- EIa-Li jVfc ^ fc+1 , and 

PK 3 )<P, e (i-^nf = i^i^)- ■ 



With Eq. d!51 l. we are in position to drive the exponential 
decay of Pj >e , as established in Lemma [4] 

Lemma 4: The probability the system is not in "good 
configuration" in the ith detection period, Pi, e , decays expo- 
nentially in the detection period index i. 

Proof: We have the exponential decay of miss detection 
probability Pj m and false alarm probability Pij with respect 
to the detection period length Li, which further indicates 
the exponential decay of the quantity Pij. m . Therefore one 
concludes by the recursion equation ( fT3T > that the exponential 
decay in P ie in the detection period index i is guaranteed. ■ 
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